Final Project: Boston Weather Trend

This project tends to explore the weather trend of Boston during the past 82 years, from Jan. 1, 1936 to May 1, 2018.
Specifically, we will be seeing the following analysis in this project:
- Temperature Gradient of the past 82 years by month for horizontal comparison;
- Decomposed Time series of the average temperature during the past 82 years;
- Temperature Forecast for the future five years;
- Correlations between average temperature and other weather features.

I’ll also test the following hypothesis through this project:
1. The average temperature is gradually increasing overall;
2. The precipitation is positively correlated with the temperature;
3. The temperature is negatively correlated with the wind speed;
4. The amount of snowfall is negatively correlated with the temperature.

# install.packages("tidyverse")
# install.packages(c("RColorBrewer", "dplyr", "dygraphs", "ggfortify"))
#install.packages("ggfortify")
#install.packages("rmdformats")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ tidyr   0.8.0     ✔ stringr 1.3.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)
library(dplyr)
library(dygraphs)
library(ggfortify)
## Warning: package 'ggfortify' was built under R version 3.4.4
library(rmdformats)

0. Data Preparation

The data comes from National Center for Environmental Information: https://www.ncdc.noaa.gov.

#boston <- read.csv("https://www.ncei.noaa.gov/orders/cdo/1337764.csv")
#save(boston, file="boston.Rdata")

load("boston.Rdata")
datetxt <- boston$DATE
datetxt <- as.Date(datetxt)
boston <- add_column(boston,
                     YEAR = as.numeric(format(datetxt, format = "%Y")),
                     Month = as.numeric(format(datetxt, format = "%m")),
                     DAY = as.numeric(format(datetxt, format = "%d")), .after = 6) %>% 
  mutate(MONTH = month.abb[Month])
boston$month_ordered <- factor(boston$MONTH, levels = month.abb)

boston <- boston[,c("NAME","DATE","YEAR","Month","DAY","PRCP","SNOW","SNWD","TAVG","TMAX","TMIN","MONTH","month_ordered","AWND")]

1. Temperature Gradients

Average Temperature

ggplot(data=boston, aes(x=YEAR,y=month_ordered)) + 
  geom_tile(aes(fill = TAVG),colour = "white") + 
  scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
                       limits = c(-20, 100)) + 
  theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.title.x=element_blank(), plot.title = element_text(hjust = .5)) + 
  ggtitle("Average Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Average)")+
  scale_x_continuous(breaks=seq(1936,2018,5))

Max Temperature

ggplot(data=boston, aes(x=YEAR,y=month_ordered)) + 
  geom_tile(aes(fill = TMAX),colour = "white") + 
  scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
                       limits = c(-20, 100)) + 
  theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) + 
  ggtitle("Maximum Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Max)")+
  scale_x_continuous(breaks=seq(1936,2018,5))

Min Temperature

ggplot(data=boston, aes(x=YEAR,y=month_ordered)) + 
  geom_tile(aes(fill = TMIN),colour = "white") + 
  scale_fill_gradientn(colours=rev(brewer.pal(10,'Spectral')), na.value = "grey98",
                       limits = c(-20, 100)) + 
  theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) + 
  ggtitle("Minimum Temperature/ËšF of Boston from 1936-1-1 to 2018-5-1 (Based on Daily Min)")+
  scale_x_continuous(breaks=seq(1936,2018,5))

According to the temperature gradient maps above, temperature in June, July and August is obviously higher than other months, while December, January and February generally have the lowest temperature. December, 1962 stands out for its extradinarily low temperature.

Precipitation

ggplot(data=boston, aes(x=YEAR,y=month_ordered)) + 
  geom_tile(aes(fill = PRCP),colour = "white") + 
  scale_fill_gradientn(colours=brewer.pal(9,'BrBG'), na.value = "grey50",
                       limits = c(0, 7)) + 
  theme(legend.title=element_blank(),axis.title.y=element_blank(),axis.title.x=element_blank(),plot.title = element_text(hjust = .5)) + 
  ggtitle("Pricipitation/inch of Boston from 1936-1-1 to 2018-5-1 (Based on Daily PRCP)")+
  scale_x_continuous(breaks=seq(1936,2018,5))

As shown above, the overall precipitation in Boston is not too much. There are some months standing out with high precipitation, among which May, 1984 is the top 1 month with highest precipitation.

2. Time Series

monthly <- boston %>% select(month_ordered,YEAR,TAVG, PRCP, AWND, SNOW) %>% 
  group_by(month_ordered,YEAR) %>% 
  summarise(TAVG = mean(TAVG),PRCP = mean(PRCP), AWND = mean(AWND), SNOW = mean(SNOW)) %>% 
  arrange(YEAR)

Average Temperature

Overall time series

myts <- ts(monthly$TAVG,start=c(1936,1), end=c(2018,4), frequency=12)
ts.decompose <- decompose(myts)
autoplot(ts.decompose)+ggtitle("Time Sereis of Average Tempetation")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 24 rows containing missing values (geom_path).

Observed

dygraph(ts.decompose$x) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)

Trend

dygraph(ts.decompose$trend) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)

Seasonal

dygraph(ts.decompose$seasonal) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)

Random

dygraph(ts.decompose$random) %>% dyRangeSelector() %>% dyOptions(drawGrid = F)

Although the observed time series shows a seasonal temperature change with regular pattern, the trend time series tells us that the overall temperature of Boston is in a rising trend.

Precipitation

ts_prcp <- ts(na.omit(monthly$PRCP),start=c(1936,1), end=c(2018,4), frequency=12)
autoplot(decompose(ts_prcp))+ggtitle("Time Sereis of Precipitation")
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: Removed 24 rows containing missing values (geom_path).

The time series of precipitation demonstrates a less seasonal pattern. But it has several peaks with an interval of 30 years. For instance, the year of 1955 or os has a peak, and around 30 years later, precipitation reached another peak in 1983, which is followed by the next peak in 2010.

3. Forecast

hw <- HoltWinters(myts)
predict_temp <- predict(hw, n.ahead = 24, 
        prediction.interval = TRUE,
        level = as.numeric(0.95))

dygraph(predict_temp, main = "Predicted Average Temperature/Month") %>%
  dySeries(c("lwr", "fit", "upr"), label = "Temperature") %>%
  dyOptions(drawGrid = F) %>% 
  dyRangeSelector()

The forecast of the following 24 months follows a regular pattern, with the temperature peaking in July as around 75ËšF, and January will be having the lowest temperature 31 ËšF.

4. Correlation between Temperature and other parameters

pal <- colorRampPalette(c("blue", "yellow", "red", "green"))
mycols=pal(12)

Correlation between Temperature and Precipitation

ggplot(data=monthly,aes(x=TAVG,y=PRCP,color=factor(month_ordered))) + 
  geom_point(alpha=.5) + scale_color_manual(values=mycols) + 
  xlab('Temperature') + ylab('Precipitation')
## Warning: Removed 1 rows containing missing values (geom_point).

ggplot(data = monthly,mapping = aes(x=TAVG,y=PRCP)) + geom_point()+
  geom_smooth(method = "glm", method.args = list(family = "gaussian"))+ 
  xlab('Temperature') + ylab('Precipitation')
## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

cor.test(monthly$TAVG, monthly$PRCP, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  monthly$TAVG and monthly$PRCP
## t = -3.8407, df = 986, p-value = 0.0001305
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.18239623 -0.05949065
## sample estimates:
##        cor 
## -0.1214088

From the above plots, we can come to the conclusion that the temperature is barely correlated to the precipitation of Boston. I also conduct correlation test on these two variables, the result of which shows that the r value is -0.121, a very weak negative relationship between these two variables. The precipitation among the 12 months has no obvious difference.

Correlation between Wind Speed and Temperature

ggplot(data=monthly,aes(x=AWND,y=TAVG,color=factor(month_ordered))) + 
  geom_point(alpha=.5) + scale_color_manual(values=mycols) + 
  xlab('Wind speed') + ylab('Temprature')
## Warning: Removed 576 rows containing missing values (geom_point).

ggplot(data = monthly,mapping = aes(x = AWND, y = TAVG)) + geom_point()+
  geom_smooth(method = "glm", method.args = list(family = "gaussian")) + 
  xlab('Wind speed') + ylab('Temprature')
## Warning: Removed 576 rows containing non-finite values (stat_smooth).

## Warning: Removed 576 rows containing missing values (geom_point).

cor.test(monthly$TAVG, monthly$AWND, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  monthly$TAVG and monthly$AWND
## t = -16.233, df = 411, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6804793 -0.5624534
## sample estimates:
##        cor 
## -0.6250255

The above plots demonstrate that there is a strongly negative correlation between the temperature and wind speed in Boston. As the wind speed increases, the average temperature decreases. Correlation test is conducted on these two variables. the result shows that the r value is -0.625, a strong negative relationship between these two variables. Besides, the second plot shows an obvious difference in terms of the wind speed among different months. High wind speed are likely to concentrate in January and February, while June, July and August have relatively low wind speed.

Correlation between Snow and Temperature

ggplot(data=monthly,aes(x=TAVG,y=SNOW,color=factor(month_ordered))) + 
  geom_point(alpha=.5) + scale_color_manual(values=mycols) + 
  xlab('Temprature') + ylab('Snow')

ggplot(data = monthly,mapping = aes(x = TAVG, y = SNOW)) + geom_point()+
  geom_smooth(method = "glm", method.args = list(family = "gaussian"))+ 
  xlab('Temprature') + ylab('Snow')

cor.test(monthly$TAVG, monthly$SNOW, use = "complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  monthly$TAVG and monthly$SNOW
## t = -6.9185, df = 987, p-value = 8.2e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2737319 -0.1548030
## sample estimates:
##        cor 
## -0.2150647

From the above plots, we can come to the conclusion that the temperature is slightly negatively correlated to the amount of snowfall in Boston. As the temperature goes up, the amount of snowfall decreases. I also conduct correlation test on these two variables, the result of which shows that the r value is -0.215, a weak negative relationship between these two variables. Besides, the second plot shows that the snowfall in Boston concentrates in three months: December, January and February.

5. Regression

regression <- lm(data = boston, TAVG ~ AWND)
regression
## 
## Call:
## lm(formula = TAVG ~ AWND, data = boston)
## 
## Coefficients:
## (Intercept)         AWND  
##      64.955       -1.121
summary(regression)
## 
## Call:
## lm(formula = TAVG ~ AWND, data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -48.16 -12.91   0.31  13.32  40.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.9546     0.4797  135.40   <2e-16 ***
## AWND         -1.1210     0.0396  -28.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.63 on 12538 degrees of freedom
##   (17532 observations deleted due to missingness)
## Multiple R-squared:  0.06008,    Adjusted R-squared:  0.06001 
## F-statistic: 801.4 on 1 and 12538 DF,  p-value: < 2.2e-16
AIC(regression)
## [1] 106097.1
BIC(regression)
## [1] 106119.4
regression2 <- lm(data = boston, TAVG ~ AWND + PRCP)
regression2
## 
## Call:
## lm(formula = TAVG ~ AWND + PRCP, data = boston)
## 
## Coefficients:
## (Intercept)         AWND         PRCP  
##      65.097       -1.155        2.071
summary(regression2)
## 
## Call:
## lm(formula = TAVG ~ AWND + PRCP, data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47.85 -12.92   0.11  13.17  41.22 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 65.09705    0.48030 135.535  < 2e-16 ***
## AWND        -1.15494    0.04024 -28.698  < 2e-16 ***
## PRCP         2.07055    0.45642   4.536 5.77e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.62 on 12535 degrees of freedom
##   (17534 observations deleted due to missingness)
## Multiple R-squared:  0.06169,    Adjusted R-squared:  0.06154 
## F-statistic:   412 on 2 and 12535 DF,  p-value: < 2.2e-16
AIC(regression2)
## [1] 106060
BIC(regression2)
## [1] 106089.8
regression3 <- lm(data = boston, TAVG ~ AWND + PRCP + SNOW)
regression3
## 
## Call:
## lm(formula = TAVG ~ AWND + PRCP + SNOW, data = boston)
## 
## Coefficients:
## (Intercept)         AWND         PRCP         SNOW  
##      64.038       -1.041        3.840       -3.522
summary(regression3)
## 
## Call:
## lm(formula = TAVG ~ AWND + PRCP + SNOW, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.299 -12.665  -0.118  12.975  66.685 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  64.0380     0.4745 134.966   <2e-16 ***
## AWND         -1.0409     0.0399 -26.087   <2e-16 ***
## PRCP          3.8401     0.4560   8.421   <2e-16 ***
## SNOW         -3.5219     0.1654 -21.290   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.32 on 12534 degrees of freedom
##   (17534 observations deleted due to missingness)
## Multiple R-squared:  0.09444,    Adjusted R-squared:  0.09422 
## F-statistic: 435.7 on 3 and 12534 DF,  p-value: < 2.2e-16
AIC(regression3)
## [1] 105616.6
BIC(regression3)
## [1] 105653.8

The results of regression listed above tell us that all the parameters, namely wind speed, precipitation and snowfall, have a significant impact on the change of temperature in Boston. The limit of this project is that the data is not complete. For the future studies, I would recommend researchers to explore more weather parameters and their influence on the temperature.

Senhao Li